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AmdL dated April 1* 2004 

Reply to OfflcQ Action of January 16, 2004 

Anr^endments to the Specification: 

Please replace the paragraph at page 2, lines 9-16 with the following 
amended paragraph: 

As previously mentioned, each DNA molecule contains many genes. A 
gene is a specific sequence of nucleotide bases. These sequences carry the 
information required for constructing proteins. A protein is a large molecule 
formed of one or more chains of amino acids in a specific order. Order is 
determined by base sequence of nucleotides in the gene coding for the protein. 
Each protein has a unique function. w e H - d e fin e d -fiin ct i onal i ty. A DNA se qu e nc e 



eoHsiste- Gf m a ny b i olog i o a l l y d i st i n ct-fegl en s . For th e purpos e of th i s app l ic a t i on. 
App l ic a nt s di stir>QU4s h b e tw ee n i nt e rg e nic^ &N A a nd g e n es . Many of th e g e n e s in 
m a mma l ian ce lls are " s pl i t g e n e s", — A sp li t -ge n e con s i s ts of cod i ng and non - 
c oding ce qu e nc QS - Th e cod iRg se qu e nc es i n a g e n e aFe -co nt ai n e d with i n e xon i c 
r e gion s ( e xon s ), th a t a pp e ar s e qu e ntia l ly -se parat e d by long r e gions r e ferr e d -te-as 
ffitF on s. In a DNA molecule, there are protein-coding sequences (genes) called 
"exons". and non-cod ino-function sequences called "introns" interspersed within 
many genes. The balance of DNA sequences in the genome are other non- 
coding regions or intergenic regions. 

Please replace the paragraph at page 3, lines 3-25 with the following 
amended paragraph: 

Gene identification and gene discovery in newly sequenced genomic 
sequences is one of the most timely computational questions addressed by 
bioinformatics scientists. Popular gene finding systems include Glimmer. 
Geumark, Genscan, Genie, GENEWISE, and Grail (See Burge, C. and Karlin, 
"Prediction of complete gene structures in human genomic DNA " J Mol. BioL, 
268:78-94, 1997; Salzberg, S. et aL, "Microbial gene identification using 
interpolated Markov models." NucL Acids Res,, 26(2):544-548, 1998; Xu, Y. et ah. 
"Grail: A multi-agent neural network system for gene identification," Proc. of the 
IEEE, 84(1 0):1 544-1 552, 1996; Kulp, D. et al.. "A generalized hidden Markov 
model for the recognition of human genes in DNA/' in ISMB-96: Proc. Fourth Intl. 
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Conf. Intelligent Systems for Molecular Biology, pp. 134-141, Menio Park. Calif, 
1996» AAAI Press; Boradovsky, M, and J.D. Mclninch, "Genemark: Parallel gene 
recognition for both DhJA strands," Computers & Chemistry. 17(2): 123-133, 1993; 
and Salzberg, S. et al. eds., Computational Methods in Molecular Biology, Vol. 32 
of New Comprehensive Biochemistry, Elsevier Science B.V., Amsterdam, 1998), 
The annotations produced by gene finding systems have been made available to 
the public. Such projects include the genomes of over thirty microbial organisms, 
as well as Malaria, Drosophila, C.elegans, mouse. Human chromosome 22 and 
others. For instance. Glimmer has been widely used in the analysis of many 
microbial genomes and has reported over 98% accuracy in prediction accuracy 
(See Fraser, C. M. et a|., "Genomic sequence of a Lyme disease spirochaete, 
Bonrelia burgdorferi," Nature 390(6660):580"586, December 1997), Genie (D. 
Kufp et aL above) has been deployed In the analysis of the Drosophila genome, 
and Genscan (C. Burge and S. Kariin above) was used for analysis of human 
chromosome 22. 

Please replace the paragraph at page 4, lines 13-23 with the following 
amended paragraph: 

On a very high level, genes in human DNA and many other organisms 
have a relatively simple structure. At! eukaryotic genes, including human genes, 
are thought to share a similar layout. This layout adheres to the following 
"gramnr^ar" or pattern: start codon, exon, ( i ntron oxon) " (intrcn-exon> n, stop codon. 
The start codon is a specific S-base sequence (e.g. ATG) which signals the 
beginning of the gene. Exons are the actual genetic material that code for 
proteins as mentioned above. Introns are the spacer segments of DNA whose 
function is not clearly understood. And finally stop codons (e.g, TAA) which 
signal the end of the gene. The notation (lntron-exon)r, simply means that there 
are n alternating intron-exon segments. Genes identification procedures has to 
take into account other important issues such as polyA tail, promoters, pseudo- 
genes, alternative splicing and other features, 
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Please replace the paragraph at page 7, lines 3-9 with the following 
amended paragraph: 

By way of background, l igat e d exons are the sequence regions that are 

simple but still computationally mysterious mechanism of splicing that takes place 
after the DNA sequence has been transcribed into RNA. The process starts by 
spliceome proteins that recognize the splice signals, followed by a step where the 
introns are cut out (spliced out), and ending in a phase where the consecutive 
exons are "glued" together into a single sequence that is translated into a protein. 
Intuitively speaking this process is performed on an RNA "image" of the genomic 
sequence. 

Please replace the paragraph at page 8, lines 1-5 with the following 
amended paragraph: 

The present invention is a system for the combination of individual experts 
which Is teamed from data. Unlike the prior art, such a system exploits learned 
dependencies between experts and forms a prediction maximally consistent with 
known gene data. Statistically, predictions of the invention system will then have 
the potential to generalize to genes undiscovered by any of the individual experts 
r e f i no th e bound a r ies and v e rity tho prod i ct i on s mado by Qx ^^erts. 

Please replace the paragraph at page 10» lines 7-18 with the following 
amended paragraph: 

Bayesian network probabilistic models provide a flexible and powerful 
framework for statistical inference as well as learning of model parameters from 
data. The goal of inference is to find a distribution of ono or more a random 
variable in the network conditioned on evidence (known values) of other 
variables. Bayesian networks encompass efficient inference algorithms, such as 
Jensen's junction tree (Jensen, F.V., An Introduction to Bayesian Networks. 
Spring-Veriag, 1995) or Pearl's message passing (Pearl, J., Probabilistic 
reasoning in intelligent systems, Morgan Kaufmann. San Mateo, Calif. 1998). 

12 13* 1.02/2 i6i.j59oo Page 4 of 15 



PAGE 7(18 ' RCVD AT 4(1/2004 3:10:44 PNI [Eastern Standard Time] ' SVR:USPT0-EFXRF-1/3 " ONIS:8729306 * CSID:713 238 3008 * DURATION (mm-ss):05-32 



Sent by: CONLEY ROSE,, P-C. 713 238 8008; 04/01/04 14:03; Jfi£ffia_#851 ; Page 8/18 

AppL No.: 09/943,579 

AmdL dated April 1, 2004 

Reply to Office Action of January 1 6, 2004 

Inside a learning loop, such algorithms may be used to efficiently estimate optimal 
values of a model's parameters from data (for instance, see Jordan, M.I. ed.. 
Learning in Graphical Models. Kluwer Academic Publishers, 1998). Furthermore, 
techniques exist that can optimally determine the topology of a Bayesian network 
together with its parameters directJy from data. 

Please replace the paragraph at page 11, lines 9-18 with the following 
amended paragraph: 

Gene combiner parameters, probability tables P(Ei|Y) and P(Y), are 
learned from a training dataset of nucleotide sequences by statistically calculating 
P(Ei|Y) and P{Y) of all individual predictors Ei and labeled for ground taitti Y. For 
instance, a maximum likelihood (ML) estimate of these parameters for a training 
set of N nucleotides is 

N 

where e denotes the prediction of an expert system i. e £ {intron* exon}. and y is 
the combined prediction, y e {intron, exon}. # Ej = e. Y = y denotes the number of 
cases in the training dataset where the prediction of expert system i is e and the 
I ground truth comb i n e d pr e diotion Is y. Altematlve estimates of these parameters 
may t>e obtained using MAP (maximum a posteriori) estimation. 

Please replace the paragraph at page 14, lines 6-13 with the following 
amended paragraph: 

For that purpose, Applicants assumed that each individual expert system 
provides the following binary decision. An expert system produces a single 
labeling for every nucleotide in a sequence: E if the nucleotide is a part of an 
I exon and I if it belongs to an intron or an t nte fge nic r e gion . Using the notation of 
Applicants' models, Ei e {E,l} for an expert i. Similarly, a combined decision Y is 
either E or I. Parameters of each of the four above-discussed models of 
Bayestan network combiners 28, 31, 40, 51 were learned using a standard 
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maximum likelihood estimation in the Bayesian network framework. All prediction 
results were then obtained using a five-fold cross-validation. 

Please replace the paragraph at page 14, lines 7-12 with the following 
amended paragraph: 

An exon is satd to be exactly predicted 47 only If both its ending and 
beginning points coincides with that of a true exon. An exon is said to be missed 
57 if there is no overlap with any of the predicted exons. ME gives the 
percentage of missed exons 57 whereas WE gives the percentage of wrongly or 
overpredicted exons 49, To compute these two numbers (ME and WE), 
Applicants look for any overlap between a true and a predicted exon. — Wrong 
exofH WE) pr e dict i on i mpli e s th e predlotlon h a s no ov e rlap w i th a tru e e xon. 
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